This notebook is one of the mandatory deliverables when you submit your solution. Its structure follows the WDL evaluation criteria and it has dedicated cells where you should add information. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work. Make sure to list all the datasets used besides the ones provided.
Optimization of Soft-Mobility Drop-off Points
Theory
As described in the challenge's introduction, Porto has a poor traffic situation and e-scooters might aid in alleviating this problem. However, one needs to understand who are the scooter users:
Since the core demographic of point 1 is likely to have a subsidized bus/tram monthly ticket, this makes bus/tram the primary transportation means of these users. Our hypothesis is therefore that scooters are not a replacement for bus/tram but a complement. This hypothesis can be checked by determining the mean/median trip length, which we do in section EDA, subsection E-Scooter OD. See more details in Model 1 and 2 below.
Furthermore, we hypothesize that the scooters are being used at the start of the trip - and then followed by bus/metro - or the opposite.
Datasets
We have used the original Bus and Metro GTFS datasets, as well as the E-scooter ride and park datasets. As a complement, we have also included Census data from the Portuguese National Institute of Statistics and amenity data from Open Street Map.
Analysis
Assuming that the e-scooters are a complement to the current public transportation network, we decided to analyze what were the busiest bus and metro stops and use this as hotspots (i.e, points which likely require a scooter park) in our model. We first attempted to use passenger Ticket Validation data (metro and bus) from the third challenge. This, however, proved to be inefective because validation data is from 2020, which is strongly disrupted due to the Covid-19 lockdowns and affected particularly our public demographic - students and young professionals.
Therefore, as a measure of passenger volume, we instead used the average number of bus/metro rides during a day through a given stop, as detailed in section EDA, subsections Bus GTFS and Metro GTFS. This should be a proxy for volume, as the transportation company (STCP) likely assigns vehicle frequency proportional to the number of passengers in a given route. However, this proxy fails if there are inefficiencies in the system - e.g, there are stops for which the few passing buses are packed and stops for which the frequent buses are half-empty.
In our view, our scooters should be integrated with the current transportation network, but closely linked to the amenities that young people are likely to use. Therefore, we extracted amenity data from Open Street Map and use this to compute an amenity importance score. This goal is that sites with a larger score would be assigned a larger number of e-scooter parks.
Amenity score
We select a subset of amenities which are likely to be overused by the target population - universities, schools, fast-food restaurants, ... - and assign an importance weight to each. We then compute the distance in Km from the point of interest (POI) to the amenity and generate a score using:
$Score_{Amenity}^{POI}(Weight, Distance) = Weight \times f(Distance)$
For the amenity score of Model 1 - call this standard amenity score - we have used $f(Distance) = \theta(Threshold- Distance)/(reg + Distance)^2$, where reg is a regularization term and Threshold guarantees that only amenities sufficiently close to a POI are considered.
For the amenity score of Model 2, the points of interest are the stops. Therefore, we want to penalize amenities that are too close to these - because we want the scooter parks to be close to a bus stop which itself is not at walking distance of an important amenity. To that effect, we use $f(Distance) = \theta(Threshold- Distance)Burr(c,k)$, where $Burr$ is the Burr distribution.
Modeling
We propose two preliminary models:
Model 1 - Use weighted K-Means in a user-defined grid:
Model 2 - Use weighted K-Means on the bus/metro stops:
Results
To evaluate the models with respect to the current solution, we computed the standard amenity score for the current e-scooter parks and those created by Models 1 and 2. If our models are competitive, then their overall amenity scores should be larger than that of the current parks, meaning that they are better located.
Unfortunately, this is not the case, and both of our models underperform (with Model 2, which uses the bus stops as a reference, being slightly better). This is partially due to the fact that the grid points and the bus stops cover a part of Porto (southeast) for which there are very few marked amenities and no current scooter parks.
Start coding here! 🐱🏍
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests, zipfile
from io import BytesIO
from sklearn.cluster import KMeans
import json
import geopandas as geopd
import contextily as cx
import shapely as sh
from shapely.geometry import Point, LineString
from scipy import stats
from utils_bin_2 import utils ##IMPORTING HELPER FUNCTIONS
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
url = "https://wdl-data.fra1.digitaloceanspaces.com/porto/Soft-Mobility.zip"
filename = url.split('/')[-1]
# Sending request
req = requests.get(url)
print('Downloading Completed')
# Extracting the files to a folder "NewFolder"
zipfile_x= zipfile.ZipFile(BytesIO(req.content))
zipfile_x.extractall('/NewFolder')
Downloading Completed
Extracting the GTFS data
gtfs_bus_zipfile = zipfile.ZipFile('/NewFolder/Soft-Mobility/GTFS/gtfs_bus.zip')
gtfs_bus_zipfile.extractall('/NewFolder/Soft-Mobility/GTFS/GTFS_Bus')
gtfs_metro_zipfile = zipfile.ZipFile('/NewFolder/Soft-Mobility/GTFS/gtfs_metro.zip')
gtfs_metro_zipfile.extractall('/NewFolder/Soft-Mobility/GTFS/GTFS_Metro')
Extracting population data for the city of Porto
url_2 = "http://mapas.ine.pt/download/filesGPG/2021/municipios/BGRI2021_1312.zip"
filename = url_2.split('/')[-1]
# Sending request
req = requests.get(url_2)
print('Downloading Completed')
# Extracting the files to a folder "NewFolderPop"
zipfile_pop= zipfile.ZipFile(BytesIO(req.content))
zipfile_pop.extractall('/NewFolderPop')
Downloading Completed
Extracting amenity data from Open Street Maps using the Overpass-API
## Defining the area
new_area = (41.14, -8.68, 41.19, -8.52)
overpass_url = "http://overpass-api.de/api/interpreter"
porto_overpass_query = f"""[out:json];
area[name="Porto"]->.search;
(node[amenity="hospital"]{new_area};
way[amenity="hospital"]{new_area};
rel[amenity="hospital"]{new_area};
node[amenity="school"]{new_area};
way[amenity="school"]{new_area};
rel[amenity="school"]{new_area};
node[amenity="college"]{new_area};
way[amenity="college"]{new_area};
rel[amenity="college"]{new_area};
node[amenity="university"]{new_area};
way[amenity="university"]{new_area};
rel[amenity="university"]{new_area};
node[amenity="library"]{new_area};
way[amenity="library"]{new_area};
rel[amenity="library"]{new_area};
node[amenity="fast_food"]{new_area};
way[amenity="fast_food"]{new_area};
rel[amenity="fast_food"]{new_area};
node[amenity="cinema"]{new_area};
way[amenity="cinema"]{new_area};
rel[amenity="cinema"]{new_area};
node[amenity="theatre"]{new_area};
way[amenity="theatre"]{new_area};
rel[amenity="theatre"]{new_area};
);
out geom;
"""
porto_response = requests.get(overpass_url, params = {"data":porto_overpass_query})
p_resp_json = porto_response.json()
node_list = [ele for ele in p_resp_json["elements"] if ele["type"] == "node"]
## Handling node data
node_list_mod = []
for node in node_list:
node_dict_mod = {}
for key, value in node.items():
if key == "tags":
node_dict_mod[key] = node[key]["amenity"]
if "name" in node[key].keys():
node_dict_mod["name"] = node[key]["name"]
else:
pass
else:
node_dict_mod[key] = value
node_list_mod.append(node_dict_mod)
node_data_df = pd.DataFrame(node_list_mod)
node_data_df.head()
| type | id | lat | lon | tags | name | |
|---|---|---|---|---|---|---|
| 0 | node | 789716384 | 41.173877 | -8.566150 | cinema | Parque Nascente |
| 1 | node | 789717503 | 41.174258 | -8.566304 | fast_food | Parque Nascente |
| 2 | node | 1278714045 | 41.146628 | -8.610853 | fast_food | McDonald's Imperial |
| 3 | node | 1353885310 | 41.180177 | -8.654981 | fast_food | Praça da Alimentação |
| 4 | node | 1628786728 | 41.177450 | -8.615232 | school | Escola Preparatória e Secundária Pero Vaz de C... |
node_data_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 84 entries, 0 to 83 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 type 84 non-null object 1 id 84 non-null int64 2 lat 84 non-null float64 3 lon 84 non-null float64 4 tags 84 non-null object 5 name 81 non-null object dtypes: float64(2), int64(1), object(3) memory usage: 4.1+ KB
way_list = [ele for ele in p_resp_json["elements"] if ele["type"] == "way"]
## Handling way data
way_list_mod = []
for node in way_list:
node_dict_mod = {}
for key, value in node.items():
if key == "bounds":
node_dict_mod["lat"] = node[key]["minlat"]
node_dict_mod["lon"] = node[key]["minlon"]
elif (key == "nodes") or (key == "geometry"):
pass
elif key == "tags":
node_dict_mod[key] = node[key]["amenity"]
if "name" in node[key].keys():
node_dict_mod["name"] = node[key]["name"]
else:
pass
else:
node_dict_mod[key] = value
way_list_mod.append(node_dict_mod)
way_data_df = pd.DataFrame(way_list_mod)
way_data_df.head()
| type | id | lat | lon | tags | name | |
|---|---|---|---|---|---|---|
| 0 | way | 25689164 | 41.180441 | -8.665016 | hospital | Hospital Pedro Hispano - Unidade Local de Saúd... |
| 1 | way | 26120268 | 41.146137 | -8.620292 | hospital | Hospital Geral de Santo António |
| 2 | way | 30405435 | 41.155689 | -8.626596 | hospital | Hospital Lusíadas Porto |
| 3 | way | 35540217 | 41.159669 | -8.600185 | school | Escola Secundária Aurélia de Sousa |
| 4 | way | 35565169 | 41.167124 | -8.601268 | school | Escola Secundária António Nobre |
way_data_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 273 entries, 0 to 272 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 type 273 non-null object 1 id 273 non-null int64 2 lat 273 non-null float64 3 lon 273 non-null float64 4 tags 273 non-null object 5 name 259 non-null object dtypes: float64(2), int64(1), object(3) memory usage: 12.9+ KB
Joining the node and way data to form the amenities DataFrame.
amenities_df = pd.concat((node_data_df, way_data_df), ignore_index = True)
amenities_df["tags"].unique()
array(['cinema', 'fast_food', 'school', 'library', 'college',
'university', 'theatre', 'hospital'], dtype=object)
south, west, north, east = (41.13, -8.68, 41.19, -8.52)
porto_img, porto_ext = cx.bounds2img(west,
south,
east,
north,
ll=True,
source=cx.providers.OpenStreetMap.Mapnik
)
amenities_mercator_df = utils.mercator_transform(amenities_df)
amenities_map_dict = {amenity: i for i, amenity in enumerate(amenities_df["tags"].unique())}
amenities_mercator_df["encoded_tags"] = amenities_df["tags"].map(amenities_map_dict)
amen_cmap = plt.get_cmap('Dark2', len(amenities_map_dict))
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
amen_grouped = amenities_mercator_df.groupby('encoded_tags')
for amen_key, amen_group in amen_grouped:
amen_group.plot(color = amen_cmap((amen_key)), ax = ax, markersize = 100)
plt.legend(list(amenities_map_dict.keys()))
plt.title("Different amenities")
fig.savefig("Outputs/Amenities_Porto.png", dpi=200)
plt.show()
We have to remove primary schools from the dataset, as the demographics that attends these schools is not old enough to ride the scooters. One could argue that we should remove the basic schools as well.
prim_schools_df = amenities_df.dropna()[amenities_df["name"].str.contains("Prim").dropna()]
amenities_red_df = amenities_df.drop(prim_schools_df.index).reset_index(drop = True)
Key conclusions: Determined the ride volume per stop and represented it in the map of Porto. This allows us to understand which areas of Porto are currently being well targeted by public transport and which are not.
gtfs_bus_folder_path = '/NewFolder/Soft-Mobility/GTFS/GTFS_Bus'
pd.read_csv(gtfs_bus_folder_path + "/agency.txt").T ##Transport agency
| 0 | |
|---|---|
| agency_id | STCP |
| agency_name | Sociedade Transportes Colectivos do Porto |
| agency_url | http://www.stcp.pt |
| agency_timezone | Europe/Lisbon |
| agency_lang | pt |
bus_stop_times_df = pd.read_csv(gtfs_bus_folder_path + "/stop_times.txt")
## We can use this data to generate a density profile of the ride time, aggregated by probably by stop_id
bus_stop_times_df.head()
| trip_id | arrival_time | departure_time | stop_id | stop_sequence | stop_headsign | |
|---|---|---|---|---|---|---|
| 0 | 107_0_D_6 | 8:47:47 | 8:47:47 | BFAL1 | 7 | NaN |
| 1 | 107_0_D_6 | 8:48:27 | 8:48:27 | MCAM3 | 8 | NaN |
| 2 | 107_0_D_6 | 8:49:12 | 8:49:12 | ICAM1 | 9 | NaN |
| 3 | 107_0_D_6 | 8:50:53 | 8:50:53 | QBNJ2 | 10 | NaN |
| 4 | 107_0_D_6 | 8:51:42 | 8:51:42 | TBNJ2 | 11 | NaN |
This is a crucial piece of information to know how many buses pass at a certain stop per day.
bus_stop_times_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 831438 entries, 0 to 831437 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 trip_id 831438 non-null object 1 arrival_time 831438 non-null object 2 departure_time 831438 non-null object 3 stop_id 831438 non-null object 4 stop_sequence 831438 non-null int64 5 stop_headsign 0 non-null float64 dtypes: float64(1), int64(1), object(4) memory usage: 38.1+ MB
bus_stop_times_df["arrival_time"] = pd.to_datetime(bus_stop_times_df["arrival_time"].apply(utils.time_shifter))#.dt.time
## Changing the arrival time to the Datetime format
stop_t_unique_dic = {col: bus_stop_times_df[col].unique() for col in bus_stop_times_df.columns}
{ele: len(stop_t_unique_dic[ele]) for ele in stop_t_unique_dic.keys()}
{'trip_id': 22977,
'arrival_time': 67065,
'departure_time': 67291,
'stop_id': 2483,
'stop_sequence': 63,
'stop_headsign': 1}
Because the scooter location is restricted to the Porto area, we shall focus on the Porto bus stops.
bus_stops_df = pd.read_csv(gtfs_bus_folder_path + "/stops.txt")
bus_stops_df.head()
| stop_id | stop_code | stop_name | stop_lat | stop_lon | zone_id | stop_url | |
|---|---|---|---|---|---|---|---|
| 0 | VFRR1 | VFRR1 | VALE FERREIROS | 41.181083 | -8.535056 | GDM1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... |
| 1 | ECC2 | ECC2 | ENTRE CANCELAS | 41.181833 | -8.532444 | GDM1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... |
| 2 | MTP2 | MTP2 | MONTE PEDRO | 41.189528 | -8.524389 | GDM1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... |
| 3 | ASRR4 | ASRR4 | ALTO DA SERRA | 41.193022 | -8.518383 | GDM1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... |
| 4 | VFRR2 | VFRR2 | VALE FERREIROS | 41.181167 | -8.535056 | GDM1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... |
bus_stops_df["zone_id"].unique()
array(['GDM1', 'VLG3', 'PRT3', 'GDM2', 'MTS1', 'MTS2', 'MAI1', 'PRT2',
'PRT1', 'MAI4', 'VLG1', 'VCD8', 'VNG1', 'MAI3', 'MAI2', 'VNG4',
'VNG2', 'VLG2', 'VNG5'], dtype=object)
bus_stops_porto_df = bus_stops_df[bus_stops_df["zone_id"].str.contains("PRT")] ##Stops in the Porto area
bus_stop_positions_df_merc = utils.mercator_transform(bus_stops_porto_df)
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
bus_stop_positions_df_merc.plot(ax = ax)
plt.title("Bus stops in Porto")
plt.show()
fig.savefig("Outputs/BusStopsPorto.png", dpi=200)
Using stop times and stops
bus_stop_times_df["trip_id_mod"] = bus_stop_times_df["trip_id"].str.split("_").apply(lambda x: "_".join(x[:-1]))
bus_stop_times_df.head()
print(bus_stop_times_df.shape)
(831438, 7)
We define the average number of bus/metro rides using the rides ocurring on working days:
bus_stop_times_ut_df = bus_stop_times_df[bus_stop_times_df["trip_id"].str.contains("U")]
print(bus_stop_times_ut_df.shape)
(198406, 7)
First, we merge the stop_times_df and the relevant part of stops_df.
bus_stop_times_ut_df_aug = bus_stop_times_ut_df.merge(bus_stops_df[["stop_id","zone_id"]], on = "stop_id")
print(bus_stop_times_ut_df_aug.shape)
bus_stop_times_ut_df_aug.head()
(198406, 8)
| trip_id | arrival_time | departure_time | stop_id | stop_sequence | stop_headsign | trip_id_mod | zone_id | |
|---|---|---|---|---|---|---|---|---|
| 0 | 106_1_U_1 | 2022-05-07 07:00:00 | 7:00:00 | FRC | 1 | NaN | 106_1_U | VNG4 |
| 1 | 106_0_U_2 | 2022-05-07 07:28:00 | 7:28:00 | FRC | 15 | NaN | 106_0_U | VNG4 |
| 2 | 106_1_U_3 | 2022-05-07 07:30:00 | 7:30:00 | FRC | 1 | NaN | 106_1_U | VNG4 |
| 3 | 106_0_U_4 | 2022-05-07 07:58:00 | 7:58:00 | FRC | 15 | NaN | 106_0_U | VNG4 |
| 4 | 106_1_U_5 | 2022-05-07 08:00:00 | 8:00:00 | FRC | 1 | NaN | 106_1_U | VNG4 |
bus_stop_times_porto_df = bus_stop_times_ut_df_aug[bus_stop_times_ut_df_aug["zone_id"].str.contains("PRT")]
print(bus_stop_times_porto_df.shape)
(107131, 8)
bus_stop_times_porto_df.head()
| trip_id | arrival_time | departure_time | stop_id | stop_sequence | stop_headsign | trip_id_mod | zone_id | |
|---|---|---|---|---|---|---|---|---|
| 898 | 107_1_U_1 | 2022-05-07 06:40:00 | 6:40:00 | AREI2 | 1 | NaN | 107_1_U | PRT3 |
| 899 | 107_1_U_3 | 2022-05-07 07:30:00 | 7:30:00 | AREI2 | 1 | NaN | 107_1_U | PRT3 |
| 900 | 107_1_U_5 | 2022-05-07 08:20:00 | 8:20:00 | AREI2 | 1 | NaN | 107_1_U | PRT3 |
| 901 | 107_1_U_7 | 2022-05-07 09:10:00 | 9:10:00 | AREI2 | 1 | NaN | 107_1_U | PRT3 |
| 902 | 107_1_U_9 | 2022-05-07 10:00:00 | 10:00:00 | AREI2 | 1 | NaN | 107_1_U | PRT3 |
Defines a hourly sampled time series - from 00:00 to 23:00 - of the buses that travel through a given stop, for all the stops:
bus_full_stop_dict, bus_stop_average_dict = utils.generating_stop_data(bus_stop_times_porto_df)
bus_avg_num_rides_stops_df = pd.DataFrame(bus_stop_average_dict, index = ["Average Num Rides p Hour"]).T
bus_avg_num_rides_stops_df.shape
(947, 1)
bus_avg_num_rides_stops_df.describe()
| Average Num Rides p Hour | |
|---|---|
| count | 947.000000 |
| mean | 4.713613 |
| std | 3.509947 |
| min | 0.208333 |
| 25% | 2.041667 |
| 50% | 4.083333 |
| 75% | 6.333333 |
| max | 27.958333 |
On average, each stop has about four buses passing by per hour. However, there are much busier locations (like Campo 24 de Agosto) which have over 20 buses per hour.
## Let us figure out what are the busiest stops
busiest_bus_stops_df = bus_stops_df.join(bus_avg_num_rides_stops_df, how = "right", on = "stop_code")\
.sort_values(by = "Average Num Rides p Hour", ascending= False)
n_largest_stops = 10
busiest_bus_stops_df.head(n_largest_stops)
| stop_id | stop_code | stop_name | stop_lat | stop_lon | zone_id | stop_url | Average Num Rides p Hour | |
|---|---|---|---|---|---|---|---|---|
| 347 | CMO | CMO | CARMO | 41.147228 | -8.616959 | PRT1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... | 27.958333 |
| 2251 | PRFL | PRFL | PR.FILIPA DE LENCASTRE | 41.148305 | -8.612763 | PRT1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... | 20.291667 |
| 723 | PRG1 | PRG1 | PR. DA GALIZA | 41.153444 | -8.626333 | PRT1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... | 19.916667 |
| 732 | BS1 | BS1 | BOAVISTA - B.SUCESSO | 41.156403 | -8.627930 | PRT1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... | 19.916667 |
| 199 | BCM1 | BCM1 | BOAVISTA-CASA DA MÚSICA | 41.158903 | -8.629542 | PRT1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... | 19.750000 |
| 668 | DJOA4 | DJOA4 | D. JOÃO IV | 41.148972 | -8.601694 | PRT1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... | 19.583333 |
| 350 | PRDJ | PRDJ | PR. D. JOÃO I | 41.147581 | -8.609126 | PRT1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... | 19.416667 |
| 148 | HSA5 | HSA5 | HOSP. ST. ANTÓNIO | 41.147766 | -8.622450 | PRT1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... | 17.833333 |
| 1428 | IPO4 | IPO4 | IPO (CIRCUNVAL.) | 41.183915 | -8.604387 | PRT3 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... | 17.291667 |
| 1294 | GRC1 | GRC1 | GRACIOSA | 41.162921 | -8.627502 | PRT1 | http://www.stcp.pt/pt/viajar/paragens/?t=detal... | 17.208333 |
We now know what are the $n_{busiest}$ stops (in terms of average number of buses per hour). Its map looks something like:
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
bus_stop_positions_df_merc.plot(ax = ax, alpha = 0.5)
bus_stop_positions_df_merc.loc[busiest_bus_stops_df.index[:n_largest_stops],:].plot(ax = ax,marker = "^", markersize = 250)
plt.title("BUSiest stops in Porto")
fig.savefig("Outputs/BusiestStopsPorto.png", dpi=200)
plt.show()
Some examples are:
for stop in busiest_bus_stops_df["stop_id"][:3]:
bus_full_stop_dict[stop].plot()
plt.title(f"Stop id is: {stop}")
plt.show()
Using quantiles is quite insightful:
q_list = [0.,0.05,0.5,0.75,0.95,1.]
quantile_list = [busiest_bus_stops_df["Average Num Rides p Hour"].quantile(ele) for ele in q_list]
for quantile in q_list:
print(f"The average number of rides p hour for the {quantile} percentile is")
print(busiest_bus_stops_df["Average Num Rides p Hour"].quantile(quantile))
The average number of rides p hour for the 0.0 percentile is 0.20833333333333334 The average number of rides p hour for the 0.05 percentile is 1.0416666666666667 The average number of rides p hour for the 0.5 percentile is 4.083333333333333 The average number of rides p hour for the 0.75 percentile is 6.333333333333333 The average number of rides p hour for the 0.95 percentile is 11.445833333333331 The average number of rides p hour for the 1.0 percentile is 27.958333333333332
quantile_int_labels = [ele for ele in range(len(quantile_list)-1,0,-1)]
busiest_bus_stops_df["Quantile Interval"] = pd.cut(busiest_bus_stops_df["Average Num Rides p Hour"], quantile_list, include_lowest=True, \
labels = quantile_int_labels)
num_rides_perc_dict = {}
for quant_ind in quantile_int_labels:
num_rides_perc_dict[quant_ind] = busiest_bus_stops_df[busiest_bus_stops_df["Quantile Interval"] == quant_ind].dropna()
color_list = ["c","b","g","y","r"]
marker_list = ["^","o","<","*",">"]
marker_size_list = [50,50,50,50,250]
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
for ind, ele in enumerate(quantile_int_labels):
bus_stop_positions_df_merc.loc[num_rides_perc_dict[ele].index,:].plot(ax = ax,c = color_list[ind], marker = marker_list[ind], \
markersize = marker_size_list[ind])
plt.title("Stops in Porto - from busiest to quietest using daily average num rides")
txt_string = "Below 5th, 25th to 50th, 50th to 75th, 75th to 95th, Above 95th percentile"
plt.legend(txt_string.split(","))
fig.savefig("Outputs/BusiestStopsPorto.png", dpi = 200)
plt.show()
As expected, the center of Porto has the largest average number of rides - partially due to the fact that more lines cross the center of Porto. Key arteries of Porto like Avenida da Boavista also have a large number of rides (as represented by the yellow stars).
Key conclusions: Determined the ride volume per stop and represented it in the map of Porto. This allows us to understand which areas of Porto are currently being well targeted by public transport and which are not. Unfortunately this dataset has the problem that there is ride data missing for the yellow line (line D). See map represented below.
gtfs_metro_folder_path = '/NewFolder/Soft-Mobility/GTFS/GTFS_Metro/google_transit_dez2021'
pd.read_csv(gtfs_metro_folder_path + "/agency.txt").T ##Transport agency
| 0 | |
|---|---|
| agency_id | NaN |
| agency_name | Metro do Porto |
| agency_url | http://www.metrodoporto.pt |
| agency_timezone | Europe/Lisbon |
| agency_lang | pt |
stop_times_df = pd.read_csv(gtfs_metro_folder_path + "/stop_times.txt")
stop_times_df.head()
| trip_id | arrival_time | departure_time | stop_id | stop_sequence | stop_headsign | pickup_type | drop_off_type | shape_dist_traveled | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | ADF0 | 6:03:40 | 6:04:00 | 1 | 1 | NaN | NaN | NaN | NaN |
| 1 | ADF0 | 6:05:40 | 6:06:00 | 2 | 2 | NaN | NaN | NaN | NaN |
| 2 | ADF0 | 6:07:40 | 6:08:00 | 3 | 3 | NaN | NaN | NaN | NaN |
| 3 | ADF0 | 6:09:40 | 6:10:00 | 4 | 4 | NaN | NaN | NaN | NaN |
| 4 | ADF0 | 6:10:40 | 6:11:00 | 5 | 5 | NaN | NaN | NaN | NaN |
This is a crucial piece of information to know how many buses pass at a certain stop per day.
stop_times_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27070 entries, 0 to 27069 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 trip_id 27070 non-null object 1 arrival_time 27070 non-null object 2 departure_time 27070 non-null object 3 stop_id 27070 non-null int64 4 stop_sequence 27070 non-null int64 5 stop_headsign 0 non-null float64 6 pickup_type 0 non-null float64 7 drop_off_type 0 non-null float64 8 shape_dist_traveled 0 non-null float64 dtypes: float64(4), int64(2), object(3) memory usage: 1.9+ MB
stop_times_df["arrival_time"] = pd.to_datetime(stop_times_df["arrival_time"].apply(utils.time_shifter))
stop_t_unique_dic = {col: stop_times_df[col].unique() for col in stop_times_df.columns}
{ele: len(stop_t_unique_dic[ele]) for ele in stop_t_unique_dic.keys()}
{'trip_id': 1117,
'arrival_time': 1187,
'departure_time': 1187,
'stop_id': 83,
'stop_sequence': 36,
'stop_headsign': 1,
'pickup_type': 1,
'drop_off_type': 1,
'shape_dist_traveled': 1}
Because the scooter location is restricted to the Porto area, we shall focus on the Porto bus stops.
stops_df = pd.read_csv(gtfs_metro_folder_path + "/stops.txt")
stops_df.head()
| stop_id | stop_code | stop_name | stop_desc | stop_lat | stop_lon | zone_id | stop_url | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | NaN | Estadio do Dragao | NaN | 41.160716 | -8.582416 | PRT1 | NaN |
| 1 | 2 | NaN | Campanha | NaN | 41.150537 | -8.586245 | PRT1 | NaN |
| 2 | 3 | NaN | Heroismo | NaN | 41.146697 | -8.592978 | PRT1 | NaN |
| 3 | 4 | NaN | 24 de Agosto | NaN | 41.148796 | -8.598349 | PRT1 | NaN |
| 4 | 5 | NaN | Bolhao | NaN | 41.149785 | -8.605901 | PRT1 | NaN |
stops_df["zone_id"].unique()
array(['PRT1', 'PRT2', 'MTS1', 'MAI1', 'VCD8', 'VCD3', 'PV_VC', 'MAI2',
'VNG1', 'PRT3', 'MAI4', 'GDM1'], dtype=object)
stops_porto_df = stops_df[stops_df["zone_id"].str.contains("PRT")] ##Stops in the Porto area
stop_positions_df_merc = utils.mercator_transform(stops_porto_df)
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
stop_positions_df_merc.plot(ax = ax)
plt.title("Metro stops in Porto")
fig.savefig("Outputs/MetroStopsPorto.png", dpi = 200)
plt.show()
Using stop times and stops.
stop_times_df.head()
print(stop_times_df.shape)
(27070, 9)
Subselecting on working day rides
stop_times_ut_df = stop_times_df[stop_times_df["trip_id"].str.contains("U")]
print(stop_times_ut_df.shape)
(11750, 9)
We want to apply this to all the stops in the Porto area. First, we merge the stop_times_df and the relevant part of stops_df.
stop_times_ut_df_aug = stop_times_ut_df.merge(stops_df[["stop_id","zone_id"]], on = "stop_id")
print(stop_times_ut_df_aug.shape)
stop_times_ut_df_aug.head()
(11750, 10)
| trip_id | arrival_time | departure_time | stop_id | stop_sequence | stop_headsign | pickup_type | drop_off_type | shape_dist_traveled | zone_id | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AU0 | 2022-05-07 06:03:40 | 6:04:00 | 1 | 1 | NaN | NaN | NaN | NaN | PRT1 |
| 1 | AU1 | 2022-05-07 06:41:40 | 6:42:00 | 1 | 23 | NaN | NaN | NaN | NaN | PRT1 |
| 2 | BU0 | 2022-05-07 06:00:40 | 6:01:00 | 1 | 1 | NaN | NaN | NaN | NaN | PRT1 |
| 3 | BU2 | 2022-05-07 06:29:40 | 6:30:00 | 1 | 1 | NaN | NaN | NaN | NaN | PRT1 |
| 4 | BU4 | 2022-05-07 06:59:40 | 7:00:00 | 1 | 1 | NaN | NaN | NaN | NaN | PRT1 |
stop_times_porto_df = stop_times_ut_df_aug[stop_times_ut_df_aug["zone_id"].str.contains("PRT")].copy()
stop_times_porto_df["stop_id"] = stop_times_porto_df["stop_id"].astype("str")
print(stop_times_porto_df.shape)
(6817, 10)
stop_times_porto_df.head()
| trip_id | arrival_time | departure_time | stop_id | stop_sequence | stop_headsign | pickup_type | drop_off_type | shape_dist_traveled | zone_id | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AU0 | 2022-05-07 06:03:40 | 6:04:00 | 1 | 1 | NaN | NaN | NaN | NaN | PRT1 |
| 1 | AU1 | 2022-05-07 06:41:40 | 6:42:00 | 1 | 23 | NaN | NaN | NaN | NaN | PRT1 |
| 2 | BU0 | 2022-05-07 06:00:40 | 6:01:00 | 1 | 1 | NaN | NaN | NaN | NaN | PRT1 |
| 3 | BU2 | 2022-05-07 06:29:40 | 6:30:00 | 1 | 1 | NaN | NaN | NaN | NaN | PRT1 |
| 4 | BU4 | 2022-05-07 06:59:40 | 7:00:00 | 1 | 1 | NaN | NaN | NaN | NaN | PRT1 |
full_stop_dict, stop_average_dict = utils.generating_stop_data(stop_times_porto_df, trip_id = "trip_id")
avg_num_rides_stops_df = pd.DataFrame(stop_average_dict, index = ["Average Num Rides p Hour"]).T
avg_num_rides_stops_df.shape
(25, 1)
avg_num_rides_stops_df.describe()
| Average Num Rides p Hour | |
|---|---|
| count | 25.000000 |
| mean | 11.361667 |
| std | 9.084231 |
| min | 0.083333 |
| 25% | 0.166667 |
| 50% | 17.291667 |
| 75% | 20.458333 |
| max | 20.625000 |
Determining the busiest stops:
metroest_stops_df = stops_df.join(avg_num_rides_stops_df, how = "right", on = "stop_id")\
.sort_values(by = "Average Num Rides p Hour", ascending= False)
n_largest_stops = 10
metroest_stops_df.head(n_largest_stops)
| stop_id | stop_code | stop_name | stop_desc | stop_lat | stop_lon | zone_id | stop_url | Average Num Rides p Hour | |
|---|---|---|---|---|---|---|---|---|---|
| 5 | 6 | NaN | Trindade | NaN | 41.152277 | -8.609299 | PRT1 | NaN | 20.625000 |
| 12 | 13 | NaN | Sete Bicas | NaN | 41.182422 | -8.652149 | PRT2 | NaN | 20.458333 |
| 13 | 14 | NaN | Senhora da Hora | NaN | 41.188101 | -8.654466 | PRT2 | NaN | 20.458333 |
| 6 | 7 | NaN | Lapa | NaN | 41.157107 | -8.616723 | PRT1 | NaN | 20.458333 |
| 7 | 8 | NaN | Carolina Michaelis | NaN | 41.158585 | -8.622246 | PRT1 | NaN | 20.458333 |
| 8 | 9 | NaN | Casa da Musica | NaN | 41.160575 | -8.628282 | PRT1 | NaN | 20.458333 |
| 9 | 10 | NaN | Francos | NaN | 41.165549 | -8.636347 | PRT1 | NaN | 20.458333 |
| 10 | 11 | NaN | Ramalde | NaN | 41.172962 | -8.641766 | PRT2 | NaN | 20.458333 |
| 11 | 12 | NaN | Via Rapida Viso | NaN | 41.177255 | -8.646498 | PRT2 | NaN | 20.458333 |
| 2 | 3 | NaN | Heroismo | NaN | 41.146697 | -8.592978 | PRT1 | NaN | 17.291667 |
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
stop_positions_df_merc.plot(ax = ax, alpha = 0.5)
stop_positions_df_merc.loc[metroest_stops_df.index[:n_largest_stops],:].plot(ax = ax,marker = "^", markersize = 250)
plt.title("Metroest stops in Porto")
plt.show()
More details on the missing data hinted above:
print(metroest_stops_df.loc[66])
stop_id 67 stop_code NaN stop_name I.P.O. stop_desc NaN stop_lat 41.181253 stop_lon -8.604521 zone_id PRT3 stop_url NaN Average Num Rides p Hour 0.083333 Name: 66, dtype: object
print(full_stop_dict[66])
stop_id DU0 DU1 DUPU0 DUPU1 00:00:00 66 NaN NaN NaN NaN 01:00:00 66 NaN NaN NaN NaN 02:00:00 66 NaN NaN NaN NaN 03:00:00 66 NaN NaN NaN NaN 04:00:00 66 NaN NaN NaN NaN 05:00:00 66 NaN NaN NaN NaN 06:00:00 66 1.0 1.0 1.0 NaN 07:00:00 66 NaN NaN NaN 1.0 08:00:00 66 NaN NaN NaN NaN 09:00:00 66 NaN NaN NaN NaN 10:00:00 66 NaN NaN NaN NaN 11:00:00 66 NaN NaN NaN NaN 12:00:00 66 NaN NaN NaN NaN 13:00:00 66 NaN NaN NaN NaN 14:00:00 66 NaN NaN NaN NaN 15:00:00 66 NaN NaN NaN NaN 16:00:00 66 NaN NaN NaN NaN 17:00:00 66 NaN NaN NaN NaN 18:00:00 66 NaN NaN NaN NaN 19:00:00 66 NaN NaN NaN NaN 20:00:00 66 NaN NaN NaN NaN 21:00:00 66 NaN NaN NaN NaN 22:00:00 66 NaN NaN NaN NaN 23:00:00 66 NaN NaN NaN NaN
IPO does not have only 4 trips per day, data is missing.
Key Conclusions:
e_scooter_od_path = r"/NewFolder/Soft-Mobility/E-Scooter OD/wdl_od_view.csv"
e_scooter_df = pd.read_csv(e_scooter_od_path)
e_scooter_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 489279 entries, 0 to 489278 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 recorded 489279 non-null int64 1 id 489279 non-null int64 2 device_id 489279 non-null object 3 timestamp 489279 non-null int64 4 event_types 489279 non-null object 5 vehicle_state 489279 non-null object 6 trip_state 0 non-null float64 7 telemetry_timestamp 489279 non-null int64 8 trip_id 179449 non-null object 9 lat 489279 non-null float64 10 lng 489279 non-null float64 dtypes: float64(3), int64(4), object(4) memory usage: 41.1+ MB
for col in ["recorded", "timestamp", "telemetry_timestamp"]:
e_scooter_df[col] = pd.to_datetime(e_scooter_df[col],unit='ms')
e_scooter_df = e_scooter_df.sort_values(by = "recorded").reset_index().drop(columns = ["index"], axis = 1).rename(columns = {"lng":"lon"})
scooter_unique_dic = {col: e_scooter_df[col].unique() for col in e_scooter_df.columns}
Check what is the multiplicity of the columns:
{key: len(scooter_unique_dic[key]) for key in scooter_unique_dic.keys()}
{'recorded': 489279,
'id': 489279,
'device_id': 3720,
'timestamp': 451940,
'event_types': 38,
'vehicle_state': 7,
'trip_state': 1,
'telemetry_timestamp': 437966,
'trip_id': 79874,
'lat': 19692,
'lon': 72030}
Split scooters into working and non-working to have an idea of what percentage (from total) of scooters is working:
working_filter = (e_scooter_df["vehicle_state"] == "available") | (e_scooter_df["vehicle_state"] == "on_trip")
working_e_scooter_df = e_scooter_df[working_filter].copy()
not_working_filter = (e_scooter_df["vehicle_state"] == "removed") | (e_scooter_df["vehicle_state"] == "non_operational") \
| (e_scooter_df["vehicle_state"] == "elsewhere") | (e_scooter_df["vehicle_state"] == "unknown")
not_working_e_scooter_df = e_scooter_df[not_working_filter].copy()
scooter_encoding = {"on_trip": 1, "available": 2}
working_e_scooter_df["route_encoding"] = working_e_scooter_df["vehicle_state"].map(scooter_encoding)
print(f"Working fraction: {round(len(working_e_scooter_df)/len(e_scooter_df),3)}")
print(f"Not Working fraction: {round(len(not_working_e_scooter_df)/len(e_scooter_df),3)}") ## See more below
Working fraction: 0.476 Not Working fraction: 0.521
A scooter is either running or available 47% of the time, which seems a bit low. Maybe their batteries are low and they need to be recharged often? This is important to understand because the scooter might have some limitations as a means of transport - apart from carrying only 1/2 young people, generally used when the weather is good and roads are dry.
e_s_positions_df_merc = utils.mercator_transform(working_e_scooter_df)
cmap = {1: "b", 2: "r"}
e_s_positions_df_merc["route_encoding"] = working_e_scooter_df["route_encoding"]
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
grouped = e_s_positions_df_merc.groupby('route_encoding')
for key, group in grouped:
group.plot(color = cmap[key], ax = ax, alpha = 0.2)
plt.title("Scooter locations")
fig.savefig("Outputs/Scooter_Locations.png", dpi = 200)
plt.show()
drop_off_df = e_scooter_df[e_scooter_df["event_types"] == "{provider_drop_off}"]
drop_off_positions_df_merc = utils.mercator_transform(drop_off_df)
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
drop_off_positions_df_merc.plot(color = "g", ax = ax)
plt.title("Scooter drop-off locations")
fig.savefig("Outputs/Scooter_Dropoff_Locations.png", dpi = 200)
plt.show()
Let us investigate why the scooters do not work:
not_working_e_scooter_df["vehicle_state"].value_counts()
elsewhere 96841 removed 90528 unknown 55155 non_operational 12301 Name: vehicle_state, dtype: int64
Elsewhere and removed are the most common reasons.
not_working_e_scooter_df.groupby(["vehicle_state", "event_types"]).agg({"recorded":"count"})
| recorded | ||
|---|---|---|
| vehicle_state | event_types | |
| elsewhere | {trip_leave_jurisdiction} | 16369 |
| {unspecified,located} | 80472 | |
| non_operational | {battery_low} | 4371 |
| {comms_restored,battery_low} | 1228 | |
| {located,battery_low} | 3164 | |
| {missing,located,battery_low} | 941 | |
| {provider_drop_off,battery_low} | 267 | |
| {trip_cancel,battery_low} | 15 | |
| {trip_end,battery_low} | 2314 | |
| {unspecified} | 1 | |
| removed | {decommissioned} | 142 |
| {maintenance_pick_up} | 89974 | |
| {trip_cancel,decommissioned} | 1 | |
| {trip_end,decommissioned} | 2 | |
| {unspecified} | 409 | |
| unknown | {comms_lost} | 13347 |
| {missing} | 41715 | |
| {reservation_cancel,comms_lost} | 14 | |
| {reservation_cancel,missing} | 6 | |
| {trip_cancel,comms_lost} | 16 | |
| {trip_cancel,missing} | 16 | |
| {trip_end,comms_lost} | 9 | |
| {trip_end,missing} | 31 | |
| {unspecified} | 1 |
life_of_a_scooter_df = e_scooter_df[e_scooter_df["device_id"] == e_scooter_df["device_id"].unique()[0]]
len(e_scooter_df["device_id"].unique())
3720
There are/were 3720 different scooters in the system.
decom_scooter_dic = {}
running_scooter_dic = {}
for scooter_id in e_scooter_df["device_id"].unique():
single_scooter_df = e_scooter_df[e_scooter_df["device_id"] == scooter_id]
if single_scooter_df["event_types"].iloc[-1] == "{decommissioned}":
decom_scooter_dic[scooter_id] = (single_scooter_df.iloc[-1,0] - single_scooter_df.iloc[0,0]).total_seconds()/3600
else:
running_scooter_dic[scooter_id] = (single_scooter_df.iloc[-1,0] - single_scooter_df.iloc[0,0]).total_seconds()/3600
print(round(len(decom_scooter_dic)/len(e_scooter_df["device_id"].unique()),2))
print(round(len(running_scooter_dic)/len(e_scooter_df["device_id"].unique()),2))
0.03 0.97
Only three percent of the total amount of scooters have been have been decommissioned.
decom_scooters_lifetime = np.array(list(decom_scooter_dic.values()))
running_scooters_lifetime = np.array(list(running_scooter_dic.values()))
full_df = pd.DataFrame({"Running": running_scooters_lifetime})
decom_df = pd.DataFrame({"Decomm": decom_scooters_lifetime})
full_df["Decomm"] = decom_df["Decomm"]
full_df[full_df["Running" ]> 24*3].boxplot()
plt.show()
This, of course, is the total life time in hours. It would be interesting to see what is the running time.
Computing the working time of the scooters and with it, the working distance:
start_times_dict, end_times_dict, time_duration_scooter_trips, length_duration_scooter_trips = utils.computing_scooter_events(e_scooter_df)
full_array_start_times = np.concatenate(tuple(values for values in start_times_dict.values()))
full_array_end_times = np.concatenate(tuple(values for values in end_times_dict.values()))
full_array_times = np.concatenate(tuple(values for values in time_duration_scooter_trips.values()))
full_array_lengths = np.concatenate(tuple(values for values in length_duration_scooter_trips.values()))
pd.Series(full_array_times).describe()
count 56154.000000 mean 0.275459 std 0.302160 min 0.001667 25% 0.101111 50% 0.183056 75% 0.336389 max 3.083333 dtype: float64
pd.Series(full_array_lengths).describe()
count 56154.000000 mean 5.509180 std 6.043203 min 0.033333 25% 2.022222 50% 3.661111 75% 6.727778 max 61.666667 dtype: float64
plt.hist(full_array_lengths, bins = 20)
plt.ylabel("Counts")
plt.xlabel("Distance (Km)")
plt.show()
Let us assume that the longest ride takes 20 mins ~ 0.33h, which would still be around 7km:
truncated_array_lengths = np.array([ele for ele in full_array_lengths if ele < 7])
plt.hist(truncated_array_lengths, bins = 20, density = True)
plt.ylabel("Counts")
plt.xlabel("Distance (Km)")
plt.show()
Interestingly, it resembles a Burr distribution.
Check the dependence with start and end times
plt.hist(full_array_start_times, bins = 20, alpha = 0.4)
plt.hist(full_array_end_times, bins = 20, alpha = 0.2)
plt.legend(["Start","End"])
plt.show()
Key conclusions Where are the current e-scooter parks located.
escooter_parks_path = r"/NewFolder/Soft-Mobility/E-Scooter Parks"
escooter_file_name = "/e-scooter_parks.json"
with open(escooter_parks_path + escooter_file_name) as json_data:
data = json.load(json_data)
data.keys()
dict_keys(['displayFieldName', 'fieldAliases', 'geometryType', 'spatialReference', 'fields', 'features'])
fele_features = data['features'][:1]
print(fele_features)
[{'attributes': {'objectid_1': 199, 'toponimo': 'Rua de D. Pedro V', 'n_id': 204, 'estado': 'ATIVO', 'sinal_vert': 'NAO', 'sinal_info_vert': 'SIM', 'n_lugares': 10}, 'geometry': {'x': -8.630972406820455, 'y': 41.149252223614546}}]
<class 'list'>
<class 'dict'>
feature_df = pd.DataFrame(data['features'])
feature_df.columns
Index(['attributes', 'geometry'], dtype='object')
feature_attributes_df = feature_df["attributes"].apply(pd.Series)
feature_geometry_df = feature_df["geometry"].apply(pd.Series)
escooter_parks_df = feature_attributes_df.join(feature_geometry_df, how = "left")
escooter_parks_df["estado"].value_counts()
ATIVO 216 Name: estado, dtype: int64
escooter_parks_df = escooter_parks_df.sort_values(by = "objectid_1").set_index("objectid_1").rename(columns = {"x":"lon","y":"lat"})
escooter_parks_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 216 entries, 1 to 1017 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 toponimo 216 non-null object 1 n_id 216 non-null int64 2 estado 216 non-null object 3 sinal_vert 56 non-null object 4 sinal_info_vert 216 non-null object 5 n_lugares 216 non-null int64 6 lon 216 non-null float64 7 lat 216 non-null float64 dtypes: float64(2), int64(2), object(4) memory usage: 15.2+ KB
escooter_parks_df[["estado", "sinal_vert","sinal_info_vert", "n_lugares"]].fillna("No info").value_counts()
## This is the same as performing a groupby
## We have 48 parks which have no vertical signal, but do have vertical information signal (thanks Obama)
## And have 10 scooters per park
## Only two parks have 20 scooters per park
estado sinal_vert sinal_info_vert n_lugares
ATIVO No info SIM 10 158
NAO SIM 10 48
SIM SIM 10 6
NAO SIM 20 2
No info SIM 20 2
dtype: int64
park_positions_df_merc = utils.mercator_transform(escooter_parks_df)
Doing a quick clustering - a more educated guess to the number of clusters could be determined by using the silhouette score.
X_parks = escooter_parks_df[["lon","lat"]].values
k_means = KMeans(n_clusters = 4)
y_park_pred = k_means.fit_predict(X_parks)
park_clust = k_means.cluster_centers_
positions_cluster_df_merc = geopd.GeoDataFrame(crs="EPSG:4326", geometry = geopd.points_from_xy(x=park_clust[:,0], y=park_clust[:,1]))\
.to_crs("EPSG:3857")
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
drop_off_positions_df_merc.plot(color = "r", ax = ax, alpha = 0.25)
positions_cluster_df_merc.plot(ax = ax, c = "g", marker = "^", markersize = 500)
park_positions_df_merc.plot(ax = ax, c = "b")
plt.legend(["Drop-off","Cluster","Parks"])
plt.title("E-scooter Parks")
fig.savefig("Outputs/EScooterParks.png", dpi = 200)
plt.show()
Note that the scooters were often dropped by the company outside of the parks, which is rather strange.
final_amenities_red_df = amenities_red_df[["lat","lon","name","tags"]]
final_bus_stops_df = busiest_bus_stops_df.reset_index()[["stop_lat","stop_lon", "Average Num Rides p Hour", "stop_name"]]
final_metro_stops_df = metroest_stops_df.reset_index()[["stop_lat","stop_lon", "Average Num Rides p Hour", "stop_name"]]
final_transport_stops_df = pd.concat([final_bus_stops_df, final_metro_stops_df], ignore_index = True)\
.rename(columns = {"stop_lat":"lat","stop_lon":"lon", "stop_name":"name"})
final_transport_stops_df["tags"] = "stop"
final_transport_stops_df["Average Num Rides Normalized"] = utils.min_max_scaler(final_transport_stops_df["Average Num Rides p Hour"])
final_comb_amenities_df = pd.concat([final_amenities_red_df,final_transport_stops_df.drop(columns = ["Average Num Rides p Hour"])]\
, ignore_index = True)
pop_data_df = geopd.read_file("/NewFolderPop/BGRI2021_1312.gpkg")
polygons = [x for x in pop_data_df['geometry'].to_crs('epsg:4326')]
s = geopd.GeoSeries(polygons)
porto_area = s.unary_union
xmin, ymin, xmax, ymax= pop_data_df.to_crs('epsg:4326').total_bounds
n_cells = 75
cell_size = (xmax-xmin)/n_cells
crs = "+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs"
grid_cells = []
for x0 in np.arange(xmin, xmax+cell_size, cell_size ):
for y0 in np.arange(ymin, ymax+cell_size, cell_size):
# bounds
x1 = x0-cell_size
y1 = y0+cell_size
grid_cells.append(sh.geometry.box(x0, y0, x1, y1))
cell = geopd.GeoDataFrame(grid_cells, columns=['geometry'], crs = 'epsg:4326')
points = cell.centroid
/tmp/ipykernel_4674/2856847100.py:14: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation. points = cell.centroid
points_df = geopd.GeoDataFrame(geometry=geopd.GeoSeries(points))#.to_crs('epsg:4326')
points_df['in_porto'] = points_df['geometry'].within(porto_area)
points_in_porto_df = points_df[points_df.in_porto == True].copy()
points_in_porto_df['lon'] = points_in_porto_df['geometry'].x
points_in_porto_df['lat'] = points_in_porto_df['geometry'].y
points_in_porto_df['population'] = points_in_porto_df.apply(lambda x: utils.get_region(x['geometry']\
, pop_data_df.to_crs('epsg:4326')), axis = 1)
Generating an equally spaced grid of points.
fig, ax = plt.subplots(1, figsize=(16, 16))
ax.imshow(porto_img, extent=porto_ext)
points_in_porto_merc_df = utils.mercator_transform(points_in_porto_df)
points_in_porto_merc_df.plot(ax = ax, alpha=0.2, color = 'red', marker = 'o')
plt.show()
Note that some points are in the middle of the river, which would need to be solved.
Initial clustering
kmeans_grid = KMeans(n_clusters = 210)
kmeans_grid.fit(points_in_porto_df[['lat','lon']])
centroids_grid = pd.DataFrame(kmeans_grid.cluster_centers_, columns = ['lat', 'lon'])
fig, ax = plt.subplots(1, figsize=(16, 16))
ax.imshow(porto_img, extent=porto_ext)
centroids_grid_merc = utils.mercator_transform(centroids_grid)
centroids_grid_merc.plot(ax = ax)
plt.title("Centroids in Porto")
plt.show()
amenities_relative_weigh_dict = {'bar':1, 'fast_food':2,'theatre':2,'cinema': 2\
, 'stop' : 3, 'school': 4,'hospital': 5, 'library': 5 \
, 'college': 5, 'university': 8}
points_in_porto_distances_df = points_in_porto_df.copy(deep = True)
points_in_porto_distances_df['amenities_score'] = points_in_porto_distances_df\
.apply(lambda x: utils.amenity_score_calc(x['lon'], x['lat'], final_comb_amenities_df, amenities_relative_weigh_dict), axis = 1)
points_in_porto_distances_df['amenities_score'].plot(kind = 'hist')
plt.xlabel("Amenities Score")
plt.show()
q_list = [0.,0.05,0.5,0.75,0.95,1.]
quantile_list = [points_in_porto_distances_df['amenities_score'].quantile(ele) for ele in q_list]
quantile_int_labels = [ele for ele in range(len(quantile_list)-1,0,-1)]
points_in_porto_distances_df["amenity_quantile_interval"] = pd.cut(points_in_porto_distances_df['amenities_score'], \
quantile_list, include_lowest=True, \
labels = quantile_int_labels)
points_in_porto_merc_df["amenity_quantile_interval"] = points_in_porto_distances_df["amenity_quantile_interval"]
cmap = plt.get_cmap('Dark2', len(quantile_list))
grouped = points_in_porto_merc_df.groupby("amenity_quantile_interval")
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
for key, group in grouped:
group.plot(color = cmap((key)), ax = ax, alpha = 0.8)
plt.legend(quantile_int_labels)
plt.title("Amenity Quantile Score")
fig.savefig("Outputs/Amenity_Quantile_Score.png", dpi = 200)
plt.show()
points_in_porto_distances_df["amenity_normalized"] = 100*utils.min_max_scaler(points_in_porto_distances_df["amenities_score"])
points_in_porto_distances_df["population_normalized"] = 100*utils.min_max_scaler(points_in_porto_distances_df["population"])
points_in_porto_distances_df["final weight"] = 1*points_in_porto_distances_df["population_normalized"] + \
10*points_in_porto_distances_df["amenity_normalized"]
points_in_porto_distances_df["final weight"].hist()
plt.show()
kmeans_grid_weighted = KMeans(n_clusters = 210)
kmeans_grid_weighted.fit(points_in_porto_distances_df[['lat','lon']], points_in_porto_distances_df["final weight"])
centroids_grid_weighted = pd.DataFrame(kmeans_grid_weighted.cluster_centers_, columns = ['lat', 'lon'])
centroids_grid_weighted_merc = utils.mercator_transform(centroids_grid_weighted)
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
centroids_grid_weighted_merc.plot(ax = ax)
centroids_grid_merc.plot(ax = ax, marker = "^", color = 'red')
park_positions_df_merc.plot(ax = ax, marker = "v", c = "y")
plt.legend(["Weighted Centroids","Original Centroids","E-scooter parks"])
fig.savefig("Outputs/Model_1_Results.png", dpi = 200)
plt.title("Model 1 Results")
plt.show()
Comparing the amenity score between the new scooter parks and the old ones:
centroids_grid_weighted['amenities_score'] = centroids_grid_weighted\
.apply(lambda x: utils.amenity_score_calc(x['lon'], x['lat'], \
final_comb_amenities_df, amenities_relative_weigh_dict), axis = 1)
centroids_grid_weighted['amenities_score'].describe()
count 210.000000 mean 759.167938 std 339.391086 min 164.608367 25% 544.632340 50% 695.914696 75% 920.340482 max 1932.669810 Name: amenities_score, dtype: float64
escooter_parks_df['amenities_score'] = escooter_parks_df\
.apply(lambda x: utils.amenity_score_calc(x['lon'], x['lat'], final_comb_amenities_df, amenities_relative_weigh_dict), axis = 1)
escooter_parks_df['amenities_score'].describe()
count 216.000000 mean 1027.821111 std 422.029965 min 320.826131 25% 717.713035 50% 920.750610 75% 1349.298286 max 2015.402688 Name: amenities_score, dtype: float64
The original parks perform better - regarding our amenities score - than the new parks.
choice_of_input_df = busiest_bus_stops_df
stop_amen_importance_dic = utils.stop_amenity_weighter(choice_of_input_df, amenities_red_df)
stop_importance_dic = {stop: sum(values) for stop, values in stop_amen_importance_dic.items()}
stop_importance_series = pd.Series(stop_importance_dic, name = "Stop Importance")
num_rides_split_1_perc_df = choice_of_input_df.join(stop_importance_series)
Determining stop density - penalty
individual_stop_density_dic = utils.stop_density(num_rides_split_1_perc_df)
total_stop_density_dic = {stop: sum(values) for stop, values in individual_stop_density_dic.items()}
stop_density_series = pd.Series(total_stop_density_dic, name = "Stop Density")
num_rides_split_1_perc_df = num_rides_split_1_perc_df.join(stop_density_series)
num_rides_imp_sorted = num_rides_split_1_perc_df.sort_values(by = "Stop Importance", ascending = False)
Let us normalize the three weights for convenience:
original_col_list = ["Average Num Rides p Hour", "Stop Importance", "Stop Density"]
for i, column in enumerate(["Average Num Rides p Hour Normalized", "Stop Importance Normalized", "Stop Density Normalized"]):
num_rides_split_1_perc_df[column] = utils.min_max_scaler(num_rides_split_1_perc_df[original_col_list[i]])
model_input_df = num_rides_split_1_perc_df[["stop_id","stop_name","stop_lat", "stop_lon","zone_id"] +\
[col for col in num_rides_split_1_perc_df.columns if "Normalized" in col]]
final_score = utils.min_max_scaler(model_input_df["Average Num Rides p Hour Normalized"] + \
0.5*model_input_df["Stop Importance Normalized"] - \
0.5*model_input_df["Stop Density Normalized"])
k_means_base = KMeans(n_clusters = 216)
k_means_base.fit(model_input_df[['stop_lat','stop_lon']], final_score)
base_centroids_weighted = pd.DataFrame(k_means_base.cluster_centers_, columns = ['lat', 'lon'])
base_centroids_weighted_merc = utils.mercator_transform(base_centroids_weighted)
fig, ax = plt.subplots(1, figsize=(16, 16), facecolor = (1,1,1))
ax.imshow(porto_img, extent=porto_ext)
bus_stop_positions_df_merc.plot(ax = ax,c = "r", markersize = 50, alpha = 0.5)
base_centroids_weighted_merc.plot(ax = ax,c = "g", marker = "^", markersize = 120, alpha = 0.75)
park_positions_df_merc.plot(ax = ax, c = "b")
plt.legend(["Bus stops","Baseline Clustering Stops", "E-scooter parks"])
plt.title("Model 2 Results")
fig.savefig("Outputs/Model_2_Results.png", dpi = 200)
plt.show()
base_centroids_weighted['amenities_score'] = base_centroids_weighted\
.apply(lambda x: utils.amenity_score_calc(x['lon'], x['lat'], final_comb_amenities_df, amenities_relative_weigh_dict), axis = 1)
base_centroids_weighted['amenities_score'].describe()
count 216.000000 mean 853.031478 std 347.316590 min 238.412417 25% 619.896386 50% 778.824406 75% 1034.819403 max 2067.518548 Name: amenities_score, dtype: float64
escooter_parks_df['amenities_score'].describe()
count 216.000000 mean 1026.947382 std 420.972203 min 320.826131 25% 717.713035 50% 919.434682 75% 1348.253305 max 2012.177295 Name: amenities_score, dtype: float64
Although model 2 performs better than model 1, it still underperforms the current setting.
Copy here the most important visualizations (graphs, charts, maps, images, etc). You can refer to them in the Executive Summary.
Technical note: If not all the visualisations are visible, you can still include them as an image or link - in this case please upload them to your own repository.
List all of the external links (even if they are already linked above), such as external datasets, papers, blog posts, code repositories and any other materials.
Census data - Portuguese National Institute of Statistics
Amenity data from Open Street Map